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(57) ABSTRACT 

System and methods for providing a recoverable real time 
multi-tasking computer system are disclosed. In one embodi- 
ment, a system comprises a real time computing environment, 
wherein the real time computing environment is adapted to 
execute one or more applications and wherein each applica- 
tion is time and space partitioned. The system further com- 
prises a fault detection system adapted to detect one or more 
faults affecting the real time computing environment and a 
fault recovery system, wherein upon the detection of a fault 
the fault recovery system is adapted to restore a backup set of 
state variables. 
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FAULT RECOVERY FOR REAL-TIME, 
MULTI-TASKING COMPUTER SYSTEM 

GOVERNMENT LICENSE RIGHTS 

The U.S. Government may have certain rights in the 
present invention as provided for by the terms of Contract No. 
NCC-1 -393 NASA CRA awarded by NASA. 

TECHNICAL FIELD 

The present invention generally relates to multi-tasking 
computer platforms and more specifically to fault detections 
and recovery for software applications executing in real time 
multi-tasking environments. 

BACKGROUND 


The automation of aircraft functions being implemented in 2Q 
avionics systems, specifically flight critical systems, are 
migrating towards real-time multi-tasking computers. Rather 
than performing one aircraft function on a single computer, 
multiple functions, potentially of different criticality signifi- 
cance, are integrated into a single system. Flight critical dis- 25 
play functions, but not flight critical control (for example, 
fly -by-wire) functions, have been implemented using multi- 
tasking computers. Another trend is that digital electronics 
built for consumer products are getting continually smaller. 

As the digital devices become smaller, it takes less energy to 30 
corrupt those devices by placing individual bits in an unin- 
tended state. Miniaturization has increased the susceptibility 
of computer electronics and processor hardware elements to 
various upsets events. Miniaturization has reached the point 
where atmospheric neutrons now pose a threat for corrupting 35 
these devices, as well as intense electromagnetic fields pro- 
duced by environmental events such as lightning. In the mili- 
tary world, deliberate weapons that create high powered 
microwave threats are also a concern. Using only commer- 
cially available parts to build safety critical systems, it is 40 
difficult to design computer hardware which is immune from 
faults caused by these, as well as other threats. 

For the reasons stated above and for other reasons stated 
below which will become apparent to those skilled in the art 
upon reading and understanding the specification, there is a 45 
need in the art for sufficiently robust systems and methods for 
executing safety critical applications (such as those imple- 
menting fly-by-wire functions) on real-time multi-tasking 
computers that use commercially available parts. 


50 


SUMMARY 


The Embodiments of the present invention provide sys- 
tems and methods for executing safety critical applications on 
real-time multi-tasking computers and will be understood by 55 
reading and studying the following specification. 

In one embodiment, a recoverable real time multi-tasking 
computer system is presented. The system comprises a real 
time computing platform, wherein the real time computing 
platform is adapted to execute one or more applications, 60 
wherein each application is time and space partitioned. The 
system further comprises a fault detection system adapted to 
detect one or more faults affecting the real time computing 
environment and a fault recovery system, wherein upon the 
detection of a fault by the fault detection system, the fault 65 
recovery system is adapted to restore a backup set of state 
variables. 


In another embodiment, another recoverable real time 
multi-tasking computer system is presented. The system 
comprises one or more applications and one or more proces- 
sors. The one or more processors execute the one or more 
applications, wherein each application is time and space par- 
titioned. The system further comprises one or more scratch- 
pad memories, wherein the one or more processors store state 
variables for the one or more applications in the one or more 
scratchpad memories; one or more fault monitors, the one or 
more fault monitors adapted to detect one or more system 
faults occurring during the execution of a first application of 
the one or more applications; and a fault recovery system 
adapted to duplicate state variables that are stored in the one 
or more scratchpad memories. Upon the detection of a fault, 
the one or more fault monitors is further adapted to notify the 
fault recovery system, wherein the fault recovery system is 
further adapted to restore a backup set of state variables into 
the one or more scratchpad memories. The one or more pro- 
cessors are adapted to resume processing of the first applica- 
tion using the backup set of state variables. 

In another embodiment, a method for fault recovery for 
applications executing on real time multi-tasking computer 
systems wherein each application is time and space parti- 
tioned, is presented. The method comprises duplicating state 
variables for one or more computational frames; detecting a 
fault from an upset event within the computational frame in 
which the upset event occurred; and recovering state variable 
data duplicated during a computational frame prior to the 
upset event. 

In yet another embodiment, a computer-readable medium 
having program instructions for a method for fault recovery 
for applications executing on real time multi-tasking com- 
puter systems wherein each application is time and space 
partitioned is presented. The method comprises duplicating 
state variables for one or more computational frames; detect- 
ing a fault from an upset event within the computational frame 
in which the upset event occurred; and recovering state vari- 
able data duplicated during a computational frame prior to the 
upset event. 

In yet another embodiment, a rapid recovery mechanism 
for a self-checking lock-step computing lane including two or 
more processors, two or more scratchpad memories and two 
or more fault monitors, the self-checking lock-step comput- 
ing lane adapted to execute one or more applications, wherein 
each application is time and space partitioned, wherein each 
application of the one or more applications is executed by the 
two or more processors during one or more computational 
frames, wherein the two or more fault monitors are further 
adapted to detect one or more system faults within the com- 
putational frame in which the fault occurred, is presented. The 
rapid recovery mechanism comprises a first duplicate 
memory adapted to store state variables duplicated from the 
one or more scratchpad memories; and a recovery control 
logic module adapted to receive fault detection signals from 
the two or more fault monitors. Upon the detection of a fault, 
the recovery control logic module is adapted to restore a 
backup set of state variables into the two or more scratchpad 
memories. 

In still another embodiment, another recoverable real time 
multi-tasking computer system is presented. The system 
comprises means for executing two or more time and space 
partitioned software applications; means for detecting one or 
more faults affecting at least one of the two or more time and 
space partitioned software applications; and means for restor- 
ing a backup set of state variables upon the detection of a fault 
affecting the at least one of the two or more time and space 
partitioned software applications. 
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DRAWINGS 

The present invention can be more easily understood and 
further advantages and uses thereof more readily apparent, 
when considered in view of the detailed description and the 5 
following figures in which: 

FIG. 1A is a time line diagram illustrating the real time 
execution of applications on real-time multi-tasking comput- 
ers of one embodiment of the present invention; 

FIG. IB is a time line diagram illustrating an upset event to 
during the real time execution of applications on real-time 
multi-tasking computers of one embodiment of the present 
invention; 

FIG. 1C is a time line diagram illustrating fault detection 
and state variable recovery of one embodiment of the present 15 
invention; 

FIG. 2 is a block diagram illustrating a fault recovery 
system of one embodiment of the present invention; 

FIG. 3 is a block diagram illustrating another fault recovery 
system of one embodiment of the present invention; and 20 

FIG. 4 is a flow diagram illustrating a method of fault 
recovery of one embodiment of the present invention. 

In accordance with connnon practice, the various 
described features are not drawn to scale but are drawn to 
emphasize features relevant to the present invention. Refer- 25 
ence characters denote like elements throughout Figures and 
text. 

DETAILED DESCRIPTION 

30 

Fast fault recovery is important in safety critical systems, 
such as avionic computer systems, which perform real time 
computations necessary to control or stabilize dynamic sys- 
tems, such as aircraft in flight. Embodiments of the present 
invention increase a computer system’ s tolerance for faults by 35 
providing methods and systems that allow a very fast recov- 
ery from system faults. 

Embodiments of the present invention have three elements. 
The first element involves a real time computing environment 
utilizing time and space partitioning. The second element 40 
provides fault detection. The third element provides fault 
recovery. 

Computer systems implementing time and space partition- 
ing are adept at supporting real time computing recovery 
capabilities. As provided by embodiments of the present 45 
invention, time and space partitioning when combined with 
state variable recovery provides a higher level of computa- 
tional integrity than either achieves independently 

Real Time Computing Environment. Embodiments of the 
present invention employ high integrity computer systems 50 
utilizing time and space partitioning which allows hosting of 
multiple pieces of software on a single piece of hardware. 
Each piece of software is resident in hardware and can per- 
form a multitude of computational functions including but 
not limited to operating systems, monitoring systems, and 55 
application programs. 

Embodiments of the present invention can be used in safety 
critical applications such as a primary flight control applica- 
tion that must robustly execute in real time. Safety critical 
applications, such as a primary flight control application, 60 
must execute in real time to maintain the stability and control 
of an aircraft in flight and during landing. Typically, real time 
systems are designed to control physical devices (e.g. valves, 
servos, motors, heaters) that require timely processing to 
perform their designated task correctly. As used in this appli- 65 
cation, real time execution of applications refers to a com- 
puter system performing calculations at the current time 
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based on current parameters. In one embodiment, current 
parameters include current inputs from sensors. A multi-task- 
ing computer system is a computer system adapted to perform 
multiple tasks, also known as processes, using shared com- 
mon processing resources. A multi-tasking computer system 
is adapted to execute two or more software applications 
simultaneously by scheduling computer processing resources 
between the two or more software applications. In one 
embodiment of the present invention, a multi-tasking com- 
puter system is adapted to schedule computer processing 
resources to support execution of at least one application in 
real time. 

Embodiments of the present invention employ high integ- 
rity processing systems utilizing space partitioning. Accord- 
ingly, when multiple pieces of software are executed by a 
single hardware platform, it is problematic if the operation of 
one piece of software contaminates the operation of another 
piece of software running on the same platform. Thus when 
the same hardware platform is used to run both safety critical 
applications and other applications, care must be taken to 
prevent the contamination of a safety critical application by 
any other application. 

Computer systems implementing time and space partition- 
ing are adept at supporting real time computing recovery 
capabilities. Time and space partitioning of processor 
resources guarantees that one application will not corrupt the 
memory or execution space of any other application run in 
computational frames before or after it. No application can 
corrupt the timeline such that the application would overrun 
its processing time thus starving out the next application 
running in the next computational frame. As used in this 
application, the term computer system includes those ele- 
ments of an overall system that perform processing or com- 
putational functions for the overall system. In one embodi- 
ment, the computer system is a subsystem integrated into the 
overall system. 

FIG. 1A illustrates a normal execution timeline in a real 
time computing environment of one embodiment of the 
present invention. In the example illustrated in FIG. 1, a 
single hardware platform is executing multiple applications. 
The processor cycles through each computational frame, 
executing applications only within their designate computa- 
tional frame. For example, the processor executes Applica- 
tion 1 in computational frame 1 -a in order to perform com- 
putations resulting in a set of state variables N. The processor 
then switches to performing applications 2, 3 and 4 in com- 
putational frames 2 -a, 3 -a and 4 -a respectively, each produc- 
ing their own sets of state variables N. Application 1 is again 
executed in frame 1-b to perform its next frame of computa- 
tion resulting in the set of state variables N+l. FIG. 1A 
illustrates a multi-tasking hardware platform utilizing time 
and space partitioning. That is, each application is executed 
only during its own computational frame and separately 
stores state variables relevant to its computations. 

FIG. IB illustrates an upset event occurring during com- 
putational frame 1-b causing the corruption of Application 
I s state variable set N. Because of time and space partition- 
ing, the repercussions of the upset event are limited to affect- 
ing Application 1 because the processor will switch to execut- 
ing Application 2 at the start of computational frame 2-b. 

Although FIGS. 1A and IB illustrate time and space par- 
titioning with four applications, one skilled in the art upon 
reading this specification would appreciate that a computer 
system executing four applications is only presented as an 
example and is not a limitation of the present invention. 
Additionally, it would be understood by one skilled in the art 
upon reading this specification that software, such as Appli- 
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cations 1 to 4, executing on a computer system with time and 
space partitioning can include one or more pieces of operat- 
ing system software, wherein one or more of the state vari- 
ables illustrated in FIGS. 1A and IB pertain to the state of the 
computer system itself. It would also be appreciated by one 
skilled in the art upon reading this specification that compu- 
tational frames for one application, such as computational 
frames 1-a , 1 -b and 1-c for Application 1, are not necessarily 
periodic or equal in time duration as computational frames for 
another application. 

Fault Detection. In one embodiment, lock-step fault detec- 
tion allows a system to detect upset events almost immedi- 
ately. One example of lock-step fault detection is provided by 
the self-checking lock-step computing lane provided in U.S. 
Pat. No. 5,909,541. 

Traditional lock step processing implies that two or more 
processors are executing the same instructions at the same 
time. Self-checking lock-step computing provides the cross 
feeding of signals from one processing lane to the other 
processing lane and then comparing them for deviations on 
every single clock edge. FIG. 2 illustrates one embodiment 
200 of a self-checking lock-step computing lane 210 of one 
embodiment of the present invention. Self-checking lock- 
step computing lane 210 comprises at least two set of dupli- 
cate processors (212 and 214), memories (220 and 222), and 
fault detection monitors (216 and 218). On every single sys- 
tem clock edge, monitors 216 and 218 both compare the data 
bus signal and control bus signal output of processors 212 and 
214 against each other. When the output signals fail to corre- 
late, monitors 216 and 218 identify a fault. This guarantees 
that if one processor deviates (e.g. because it retrieves a 
wrong address or is provided a wrong data bit) one or both of 
monitors 216 and 218 will detect the fault on the next clock 
edge. The fault is thus detected in the same computational 
frame in which it was generated. In one embodiment, when 
either monitor 216 or monitor 218 detects a fault, the monitor 
notifies processors 212 and 214. In embodiments of the 
present invention, upon notification of a fault, processors 212 
and 214 shut off further processing of the application which 
was executing in the faulted computational frame and the 
fault recovery system is invoked. 

Fault Recovery. Fault detection allows the recovery tech- 
nology of the present invention to restore state variables in the 
event of an upset. The advantage for avionics systems is that 
a computer error is not propagated to the pilot level or the 
airplane motion level, but is detected quickly — within the 
computational frame in which the error occurred. State vari- 
able data is typically the type of data that changes slowly 
relative to the processing speed of the hardware platform 
calculating the state variables. By restoring state variables 
which are only a relatively few computational frames old and 
restarting the processing element, the resulting computa- 
tional results will contain only a negligible error due to the 
upset. In an embodiment where the affected application is a 
primary flight control application, aircraft response time is 
not jeopardized because the computations are restarted and 
recalculated in such a fast fashion. 

FIG. 1C illustrates the same timeline as FIG. IB with the 
addition of fault recovery as provided by embodiments of the 
present invention. In one embodiment, when a fault detection 
monitor, such as one of monitors 216 or 218 detects the fault 
affecting Application 1 during computational frame 1-b, the 
monitor notifies processors 212 and 214 to shut off processing 
of Application 1, and notifies recovery control logic 232. 
Meanwhile, the execution of unaffected Applications 2-4 
continue during their assigned computational frames 2-b, 3 -b 
and 4-b. Recovery control logic 232 invokes fault recovery 
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which restores Application l’s state variables as they existed 
for Application 1 after the computational frame just before 
the upset event occurred. In the illustration provided in FIG. 
1C, the upset event occurs in computational frame 1-b , cor- 
5 rupting Application l’s state variable set N. Fault recovery 
system 230 restores Application l’s state variables N-l into 
memories 220 and 222. Execution of Application 1 then 
resumes in computational frame 1-c, using the last known 
uncorrupted set of state variables from frame 1-a. One dis- 
io tinct advantage of this embodiment of the present invention is 
that fault recovery system 230 only needs to maintain copies 
of state variable data sets that are one computational frame 
old. 

In operation, in one embodiment, processors 212 and 214 
15 hold state variables for applications in respective memories 
220 and 222. The memory locations in memories 220 and 222 
used by each application to store state variables as the appli- 
cations are executed in their respective computational frame 
are referred to as “scratchpad memories”. Fault recovery 
20 system 230 creates a duplicate copy of the state variables 
stored in memories 220 and 222, creating a repository of 
recent state variable data sets. Fault recovery system 230 
stores off the state variables in real time, as processors 212 
and 214 are executing and storing the state variables in memo - 
25 lies 220 and 222. 

In one embodiment, as state variable values are produced 
by processors 212 and 214 and stored in memories 220 and 
222, there is a redundant copy made in duplicate memory 238. 
In one embodiment, duplicate memory 238 is contained in a 
30 highly isolated location to ensure the robustness of the data 
stored in duplicate memory 238. In one embodiment, dupli- 
cate memory 238 is protected from corruption by one or more 
of a metal enclosure, signal buffers (such as buffers 244 and 
246) and power isolation. 

35 In another embodiment, the redundant copy of state vari- 
ables can be stored on a hardened memory device. As used in 
this application, a hardened memory device refers to a 
memory device which is itself inherently immune to corrup- 
tion due to environmental factors. 

40 In addition to protecting applications against the corrup- 
tion of state variables, embodiments of the present invention 
further protect against undesirable consequences from appli- 
cations that stall during their computational frame, or enter 
into infinite loops. For example, in one embodiment, if an 
45 application executing within its computational frame stalls 
and never completes its frame, this fault will be detected by 
one of monitors 216 or 218. In one embodiment, one of 
monitors 216 or 218 then notifies recovery control logic 232 
to initiate a recovery. 

50 One skilled in the art upon reading this specification would 

recognize that it is undesirable to load duplicate memory 238 
with state variable data in situations where the system only 
partially completed a computing frame when the fault 
occurred. This is because duplicate memory 238 could end up 
55 storing corrupted data for that computing frame. Instead, to 
ensure that a complete valid frame of state variable data is in 
the duplicate memory and available for restoration, embodi- 
ments of the present invention provide intermediate memo- 
ries. In one embodiment, a duplicate of memories 220 and 
60 2 22 for even computational frames is loaded into even frame 
memory 234. A duplicate of memories 220 and 222 for odd 
computational frames is loaded into odd frame memory 235. 
The even frame memory 234 and odd frame memory 236 
toggle back and forth copying data into the duplicate memory 
65 2 38 to ensure that a complete valid backup memory is main- 
tained. Even frame memory 234 and odd frame memory 236 
will only copy their contents to duplicate memory 238 if the 
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intermediate memories themselves contain a complete valid 
state variable backup for a computing frame that successfully 
completes its execution. 

In another embodiment, during the normal computer ini- 
tialization sequence of computer system 210, duplicate 
memory 238, even frame memory 234 and odd frame 
memory 236 are each adapted to copy all state variables 
written to memories 220 and 222 by processors 212 and 214 
in order to set the initial state variables saved in all memories 
to the same condition. In one embodiment, after initialization 
the alternating operation of even frame memory 234 and odd 
frame memory 236 memories begins as described above. 

In one embodiment, fault recovery system 230 also 
includes variable identity array 242 which provides for the 
efficient use of memory storage. In one embodiment, instead 
of creating backup copies of every state variable for every 
application, variable identity array 242 identifies a subset of 
predefined state variables which allows recovery control 
logic 232 to backup only those state variables desired for 
certain applications into duplicate memory 238. In one 
embodiment, only state variables for predefined applications 
are included in the predefined subset of state variables that are 
duplicated into duplicate memory 238. In one embodiment, 
variable identity array 242 contains predefined state variable 
locations on an address-by -address basis. In one embodi- 
ment, variable identity array 242 allows only the desired state 
variable data to load into the intermediate memories. 

When recovery control logic 232 is notified of a detected 
fault, recovery control logic 232 retrieves the duplicate state 
variables for an upset application from duplicate memory 238 
and restores those state variables into the upset application’s 
scratchpad memory area of memories 220 and 222. In one 
embodiment, once the duplicate state variables are restored 
into memories 220 and 222, recovery control logic 232 noti- 
fies monitors 216 and 218 and processors 212 and 214 resume 
execution of the upset application using the restored state 
variables. 

In another embodiment of the present invention, monitors 
216 and 218 are adapted to notify the faulted application of 
the occurrence of a fault, instead of notifying recovery control 
logic 232. In operation, in one embodiment, upon detection of 
a fault affecting an application, the monitor notifies proces- 
sors 212 and 214 which shut off processing of the upset 
application. On the upset application’s next processing 
frame, at least one of processors 212 and 214 notify the 
faulted application of the occurrence of the fault. In one 
embodiment, upon notification of the fault, the upset appli- 
cation is adapted to request the recovery of state variables by 
notifying recovery control logic 232. In one embodiment, 
once the duplicate state variables are restored into memories 
220 and 222, recovery control logic 232 notifies monitors 216 
and 218 and processors 212 and 214 resume execution of the 
upset application using the restored state variables. 

It would be appreciated by one skilled in the art upon 
reading this specification that the present invention is not 
limited only to embodiments with self-checking lock-step 
computing lanes. In other embodiments the recovery system 
of the present invention can be adapted to accommodate 
slower fault detection systems, which may allow several com- 
putational frames to elapse before they can identify a fault 
condition. In those circumstances, the duplicate memory is 
adapted to hold not only the state variable of the most recent 
computing frame, but also hold state variable for one or more 
previous computing frames. In one embodiment, the recovery 
system is adapted to restore the N-z backup frame state vari- 
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ables to the scratchpad memory when the fault detection 
system is known to take up to z computation frames to detect 
a fault. 

FIG. 3 illustrates another embodiment of a recoverable 
5 computer platform 300 of one embodiment of the present 
invention. Computer system 350 includes one or more pro- 
cessors 352, memory 356 and fault monitor 354. For state 
variable values produced by processors 352 and stored in 
memory 356, there is a redundant copy made in duplicate 
to memory 3 68 . In one embodiment, a duplicate of memory 356 
for even computational frames is loaded into even frame 
memory 364 and a duplicate of memory 356 for odd compu- 
tational frames is loaded into odd frame memory 366. As 
described in the embodiment for FIG. 2, even frame memory 
15 364 and odd frame memory 366 toggle back and forth copy- 
ing data into duplicate memory 368. Thus duplicate memory 
368 always contains a backup of state variables for the most 
recent non-faulted computational frame. In one embodiment, 
fault recovery mechanism 360 comprises one or more dupli- 
20 cate memories 370, in which are maintained valid state vari- 
able data sets for one or more computational frames previous 
to the most recent computational frame. When recovery con- 
trol logic 362 is notified of a fault by monitor 354, fault 
recovery mechanism 360 restores the z’th frame prior state 
25 variable data set into memory 356, when monitor 354 is 
known to take up to z computational frames to detect a fault. 

In another embodiment of the present invention, one or 
more externally located fault detection monitors, such as 
monitor 3 19, are adapted to identify one or more faults affect- 
30 ing one or more applications executing on computer system 
350 and notify recovery control logic 362 to initiate a fault 
recovery as described above. In one embodiment, monitor 
319 monitors and communicates with computer system 350 
via one or more input/output ports 358. 

35 In one embodiment, fault recovery system 360 also 
includes variable identity array 372 which provides for the 
efficient use of memory storage. In one embodiment, instead 
of creating backup copies of every state variable for every 
application, variable identity array 372 identifies predefined 
40 state variable which allows fault recovery mechanism 360 to 
backup only those state variable desired for certain applica- 
tions. In one embodiment, variable identity array 372 con- 
tains predefined state variable locations on an address by 
address basis. In one embodiment, variable identity array 372 
45 allows only the desired state variable data to load into the 
intermediate memories. 

FIG. 4 provides a flow chart illustrating a method 400 of 
one embodiment of the present invention. The method com- 
prises duplicating state variables for a computational frame 
50 (410); detecting a fault within a computational frame of an 
upset event (420); and recovering state variable data from a 
computational frame prior to the upset event (430). In other 
embodiments, the method further comprises halting the 
execution of an application affected by an upset event (425) 
55 and resuming processing after recovering state variable data 
(435). When processing is restarted, the processor is able to 
resume calculations at a point very close to where the disrup- 
tion occurred. 

Several means are available to implement the fault recov- 
60 ery systems and methods of the current invention. These 
means include, but are not limited to, digital computer sys- 
tems, programmable controllers, or field programmable gate 
arrays. Therefore other embodiments of the present invention 
are program instructions resident on computer readable 
65 media which when implemented by such controllers, enable 
the controllers to implement embodiments of the present 
invention. Computer readable media include any form of 



US 7,971,095 B2 


9 

computer memory, including but not limited to punch cards, 
magnetic disk or tape, any optical data storage system, flash 
read only memory (ROM), non-volatile ROM, program- 
mable ROM (PROM), erasable -programmable ROM 
(E-PROM), random access memory (RAM), or any other 
form of permanent, semi-permanent, or temporary memory 
storage system or device. Program instructions include, but 
are not limited to computer-executable instructions executed 
by computer system processors and hardware description 
languages such as Very High Speed Integrated Circuit (VH- 
SIC) Hardware Description Language (VHDL). 

Embodiments of the present invention do not preclude 
other fault detection and recovery methods for a computer 
system from being utilized. 

Although specific embodiments have been illustrated and 
described herein, it will be appreciated by those of ordinary 
skill in the art that any arrangement, which is calculated to 
achieve the same purpose, may be substituted for the specific 
embodiment shown. This application is intended to cover any 
adaptations or variations of the present invention. Therefore, 
it is manifestly intended that this invention be limited only by 
the claims and the equivalents thereof. 

What is claimed is: 

1. A recoverable real time multi-tasking computer system 
comprising: 

areal time avionics computing platform adapted to execute 
two or more avionics applications simultaneously, 
wherein each avionics application is time and space 
partitioned; 

a fault detection system adapted to detect one or more 
faults affecting the real time avionics computing plat- 
form; and 

a fault recovery system, wherein upon the detection of a 
fault by the fault detection system, the fault recovery 
system is adapted to restore a duplicate set of state vari- 
ables, wherein the fault recovery system is further 
adapted to: 

store, duplicate, and recover only selected state variables 
from one or more frame times; and 
recover state variables pertaining to any one or more of 
the avionics applications simultaneously; 

wherein the fault recovery system operates without any 
involvement from the avionics applications, and 

wherein when a recovery of the one or more avionics 
applications occurs, the other avionics applications con- 
tinue to operate without disturbance. 

2. The system of claim 1, wherein each application of the 
two or more avionics applications is executed by the real time 
avionics computing platform during one or more computa- 
tional frames, wherein the fault detection system is further 
adapted to detect the one or more faults. 

3. The system of claim 2, wherein the fault recovery system 
is further adapted to restore the duplicate set of state variables 
from a computational frame occurring more than one frame 
before the computational frame in which the fault occurred. 

4. The system of claim 3, wherein the fault recovery system 
is further adapted to restore a duplicate set of state variables 
from a computational frame occurring one computational 
frame before the computational frame in which the fault 
occurred. 

5. The system of claim 1, wherein the fault recovery system 
further comprises: 

a first duplicate memory; 

an even frame memory, wherein the fault recovery system 
is adapted to duplicate state variables computed by the 
real time avionics computing platform during even com- 
putational frames into the even frame memory; and 
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an odd frame memory, wherein the fault recovery system is 
adapted to duplicate state variables computed by the real 
time avionics computing platform during odd computa- 
tional frames into the odd frame memory; 

5 wherein, the even frame memory and odd frame memory 
toggle back and forth duplicating state variables into the 
first duplicate memory for computational frames in 
which no fault was detected by the fault detection sys- 
tem. 

10 6. The system of claim 5, wherein the even frame memory 

and odd frame memory are further adapted to not duplicate 
into the first duplicate memory state variables for computa- 
tional frames in which a fault was detected by the fault detec - 
15 tion system. 

7. The system of claim 5, wherein the first duplicate 
memory, the even frame memory and the odd frame memory 
are further adapted to duplicate state variables computed by 
the real time avionics computing platform during initializa- 

20 tion of the real time avionics computing platform. 

8. The system of claim 5, further comprising: 

a second duplicate memory, wherein the fault recovery 
system stores duplicate sets of state variables for a plu- 
rality of computational frames. 

25 9. The system of claim 5, wherein the first duplicate 

memory is protected from corruption due to environmental 
factors by one or more of shielding from a metal enclosure, 
signal buffers, isolated power supplies and hardened memory. 

10. The system of claim 1, the fault recovery system further 
30 comprising: 

a variable identity array, adapted to identify a predefined 
subset of state variables, wherein the fault recovery sys- 
tem duplicates only the subset of state variables. 

11 . A recoverable real time multi-tasking computer system 
35 comprising: 

two or more avionics applications; 

an avionics computing platform comprising one or more 
processors, the one or more processors executing the 
two or more avionics applications simultaneously, 
40 wherein each application is time and space partitioned; 

one or more scratchpad memories, wherein the one or more 
processors store state variables for the two or more avi- 
onics applications in the one or more scratchpad memo- 
ries; 

45 one or more fault monitors, the one or more fault monitors 
adapted to detect one or more system faults occurring 
during the execution of a first application of the two or 
more avionics applications; and 

a fault recovery system adapted to duplicate state variables 
50 stored in the one or more scratchpad memories, wherein 
the fault recovery system is further adapted to: 
store, duplicate, and recover only selected state variables 
from one or more frame times; and 
recover state variables pertaining to any one or more of 
55 the avionics applications simultaneously; 

wherein the fault recovery system operates without any 
involvement from the avionics applications, 

wherein upon the detection of a fault, the fault recovery 
system is further adapted to restore a duplicate set of 
60 state variables into the one or more scratchpad memo- 
ries, 

wherein the one or more processors are adapted to resume 
processing of the first application using the duplicate set 
of state variables, and 

65 wherein when a recovery of the one or more avionics 
applications occurs, the other avionics applications con- 
tinue to operate without disturbance. 
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12. The system of claim 11, wherein upon the detection of 
the fault, the one or more fault monitors are further adapted to 
notify the fault recovery system. 

13. The system of claim 11, wherein upon the detection of 
the fault, the one or more fault monitors are further adapted to 5 
notify a first application affected by the fault, wherein the first 
application is adapted to notify the fault recovery system. 

14 . The system of claim 11 , wherein each application of the 
two or more avionics applications is executed by the one or 
more processors during one or more computational frames, to 
wherein the one or more fault monitors are further adapted to 
detect one or more system faults within the computational 
frame in which the fault occurred. 

15. The system of claim 14, the fault recovery system 

further comprising: 15 

a first duplicate memory; 

an even frame memory, wherein the fault recovery system 
is adapted to duplicate state variables stored in the one or 
more scratchpad memories during even computational 
frames into the even frame memory; and 20 

an odd frame memory, wherein the fault recovery system is 
adapted to duplicate state variables stored in the one or 
more scratchpad memories during odd computational 
frames into the odd frame memory; 
wherein the even frame memory and odd frame memory 25 
toggle back and forth duplicating state variables into the 
first duplicate memory for computational frames in 
which no fault was detected by the one or more fault 
monitors. 

16. The system of claim 15, further comprising: 30 

a second duplicate memory, wherein the fault recovery 

system stores duplicate sets of state variables for a plu- 
rality of computational frames. 

17. A method for fault recovery, the method comprising: 
executing a plurality of avionics applications simulta- 35 

neously on a real time multi-tasking avionics computer 
system wherein each avionics application is time and 
space partitioned; 

duplicating state variables for one or more computational 
frames; 40 

detecting a fault from an upset event within the computa- 
tional frame of one of the applications in which the upset 
event occurred; 

recovering state variable data duplicated during a compu- 
tational frame prior to the upset event; and 45 

restoring the duplicated state variable data to a computa- 
tional frame of the one of the applications that occurs 
immediately after the computational frame in which the 
upset event occurred, wherein the duplicated state vari- 
able data is restored without any involvement from the 50 
avionics applications, and wherein during recovery of 
the one of the applications, the other applications con- 
tinue to operate without disturbance. 

18. The method of claim 17, further comprising: 

halting the execution of an application affected by the upset 55 
event; and 

resuming processing the application affected by the upset 
event after recovering state variable data. 

19. The method of claim 17, wherein duplicating state 
variables for one or more computational frames further com- 60 
prises: 

duplicating state variables from an even computational 
frame into a first memory; 

duplicating state variables from an odd computational 
frame into a second memory; and 65 

alternately duplicating state variables from the first 
memory and the second memory into a third memory. 
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20. The method of claim 19, wherein recovering state vari- 
able data from the computational frame duplicated prior to 
the upset event further comprises: 

duplicating state variables from the third memory into one 
or more scratchpad memories. 

21. A computer-readable medium having program instruc- 
tions for a method for fault recovery, the method comprising: 

executing a plurality of avionics applications simulta- 
neously on a real time multi-tasking avionics computer 
system wherein each avionics application is time and 
space partitioned; 

duplicating state variables for one or more computational 
frames; 

detecting a fault from an upset event within the computa- 
tional frame of one of the applications in which the upset 
event occurred; 

recovering state variable data duplicated during a compu- 
tational frame prior to the upset event; and 

restoring the duplicated state variable data to a computa- 
tional frame of the one of the applications that occurs 
immediately after the computational frame in which the 
upset event occurred, wherein the duplicated state vari- 
able data is restored without any involvement from the 
avionics applications, and wherein during recovery of 
the one of the applications, the other applications con- 
tinue to operate without disturbance. 

22. The computer-readable medium of claim 21, the 
method further comprising: 

halting the execution of an application affected by the upset 
event; and 

resuming processing the application affected by the upset 
event after recovering state variable data. 

23. The computer-readable medium of claim 21, wherein 
duplicating state variables for one or more computational 
frames further comprises: 

duplicating state variables from an even computational 
frame into a first memory; 

duplicating state variables from an odd computational 
frame into a second memory; and 

alternately duplicating state variables from the first 
memory and the second memory into a third memory. 

24. The computer-readable medium of claim 23, wherein 
recovering state variable data from the computational frame 
duplicated prior to the upset event further comprises: 

duplicating state variables from the third memory into one 
or more scratchpad memories. 

25. A system comprising: 

a self-checking lock-step avionics lane including two or 
more processors; 

two or more scratchpad memories and two or more fault 
monitors, the self-checking lock-step avionics lane 
adapted to execute two or more avionics applications 
simultaneously, wherein each application is time and 
space partitioned, wherein each application of the two or 
more avionics applications is executed by the two or 
more processors during one or more computational 
frames, wherein the two or more fault monitors are fur- 
ther adapted to detect one or more system faults within 
the computational frame in which the fault occurred; 

a rapid recovery mechanism comprising: 

a first duplicate memory adapted to store state variables 
duplicated from the two or more scratchpad memo- 
ries; and 

a recovery control logic module adapted to receive fault 
detection signals from the two or more fault monitors; 

wherein the rapid recovery mechanism is further adapted 
to: 
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store, duplicate, and recover only selected state variables 
from one or more frame times; and 
recover state variables pertaining to any one or more of 
the avionics applications simultaneously; 
wherein the rapid recovery mechanism is further adapted 
to: 

store, duplicate, and recover only selected state variables 
from one or more frame times; and 
recover state variables pertaining to any one or more of 
the avionics applications simultaneously; 
wherein the rapid recovery mechanism operates without 
any involvement from the avionics applications, 
wherein upon the detection of a fault, the recovery control 
logic module is adapted to restore a duplicate set of state 
variables into the two or more scratchpad memories, and 
wherein when a recovery of the one or more avionics 
applications occurs, the other avionics applications con- 
tinue to operate without disturbance. 

26. The system of claim 25, wherein the recovery control 
logic module is further adapted to restore a duplicate set of 
state variables from a computational frame occurring more 
than one frame before the computational frame in which the 
fault occurred. 

27. The system of claim 26, the rapid recovery mechanism 
further comprising: 

an even frame memory adapted to duplicate state variables 
stored in the two or more scratchpad memories during 
even computational frames into the even frame memory; 
and 

an odd frame memory adapted to duplicate state variables 
stored in the two or more scratchpad memories during 
odd computational frames into the odd frame memory; 
wherein the even frame memory and odd frame memory 
toggle back and forth duplicating state variables into the 
first duplicate memory for computational frames in 
which no fault was detected by the two or more fault 
monitors. 

28. The system of claim 27, wherein the even frame 
memory and odd frame memory are further adapted to dis- 
card state variables for computational frames in which a fault 
was detected by the one or more fault monitors. 
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29. A recoverable real time multi-tasking computer system 
comprising: 

means for executing two or more time and space parti- 
tioned avionics applications simultaneously; 

5 means for detecting one or more faults affecting at least one 

of the two or more time and space partitioned avionics 
applications; and 

means for restoring a duplicate set of selected state vari- 
ables upon the detection of a fault affecting the at least 
to one of the two or more time and space partitioned avi- 

onics applications; 

wherein the means for restoring operates without any 
involvement from the avionics applications, and 
wherein when a recovery of the one or more avionics 
15 applications occurs, the other avionics applications con- 
tinue to operate without disturbance. 

30. The system of claim 29, wherein the means for restor- 
ing a duplicate set of state variables further comprises: 

a first means for storing state variables; 

20 a second means for storing state variables adapted to dupli - 

cate state variables computed during even computa- 
tional frames; and 

a third means for storing state variables adapted to dupli- 
cate state variables computed during odd computational 
25 frames; 

wherein the second means for storing state variables and 
the third means for storing state variables toggle back 
and forth duplicating state variables into the first means 
for storing state variables for computational frames in 
30 which no fault was detected; 

wherein the means for restoring a duplicate set of state 
variables is further adapted to restore the state variables 
from the first means for storing state variables. 

31. The system of claim 5, wherein the first duplicate 
35 memory, the even frame memory, and the odd frame memory 

are adapted to duplicate state variables computed by a real 
time operating system of the real time avionics computing 
platform. 

40 



